New space/time tradeoffs for top-k document retrieval on sequences
نویسندگان
چکیده
We address the problem of indexing a collectionD = {T1,T2, ...TD} of D string documents of total length n, so that we can efficiently answer top-k queries: retrieve k documents most relevant to a pattern P of length p given at query time. There exist linear-space data structures, that is, using O(n) words, that answer such queries in optimal O(p + k) time for an ample set of notions of relevance. However, using linear space is not sufficiently good for large text collections. In this paper we explore how far the space/time tradeoff for this problem can be pushed. We obtain three results: (1) When relevance is measured as term frequency (number of times P appears in a document Ti), an index occupying |CSA|+o(n) bits answers the query in time O(tsearch(p)+k lg2 k lg n), where CSA is a compressed suffix array indexing D, tsearch is its time to find the suffix array interval of P, and ε > 0 is any constant. (2) With the same measure of relevance, an index occupying |CSA| + n lg D + o(n lgσ + n lg D) bits answers the query in time O(tsearch(p) + k lg∗ k), where lg∗ k is the iterated logarithm of k. (3) When the relevance depends only on the documents, an index occupying |CSA|+ O(n lg lg n) bits answers the query in O(tsearch(p) + k tSA) time, where tSA is the time the CSA needs to retrieve a suffix array cell. On our way, we obtain some other results of independent interest.
منابع مشابه
Improved Compressed Indexes for Full-Text Document Retrieval
We give new space/time tradeoffs for compressed indexes that answer document retrieval queries on general sequences. On a collection of D documents of total length n, current approaches require at least |CSA| + O(n lgD lg lgD ) or 2|CSA| + o(n) bits of space, where CSA is a full-text index. Using monotone minimum perfect hash functions, we give new algorithms for document listing with frequenci...
متن کاملImproved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملImproved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملColored Range Queries and Document Retrieval
Colored range queries are a well-studied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important one-dimensional colored range queries — colored range listing, colored range top-k queries and colored range counting — and, thus, new bounds fo...
متن کاملPractical Top-K Document Retrieval in Reduced Space
Supporting top-k document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reduced-space structures that support top-k...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Theor. Comput. Sci.
دوره 542 شماره
صفحات -
تاریخ انتشار 2014